47 research outputs found

    Improving the Corrosion Resistance of a High-Strength Aluminum-Copper Casting Alloy

    Get PDF
    The B206 alloy (up to 5 wt% Cu) is the strongest aluminum foundry alloy in current use. B206 alloy can be used in a number of automotive applications, e.g. suspension knuckles and vehicle control arms, to reduce vehicle weight. Elimination of hot tearing has reawakened the interest in the 206 alloy family. However, the B206 alloy is susceptible to intergranular/pitting corrosion which restricts its current applications. A heterogeneous distribution of Cu-containing intermetallic precipitates in the as-cast condition resulted in severe intergranular corrosion. The improved 3-step ST + 2-step AA provides better corrosion resistance compared to 2-step ST + 1-step AA. Longer first step AA time eliminated intergranular corrosion but resulted in low level pitting corrosion. The elongation was found to decrease with the increase in AA temperature and time. It is difficult to obtain both excellent corrosion resistance and elongation (≥10%) for the overaged condition

    Ferritic Nitrocarburizing of Plain Carbon Steels

    Get PDF
    Nitrocarburizing is a case hardening process which improves the hardness and wear resistance of a component, but results in geometric distortions. Torque converter pistons (automotive component) and Navy C-ring specimens (measuring tool) made from SAE 1010 plain carbon steel were used for the distortion analysis. Navy C-rings are generally used for studying the dimensional changes of a material before analyzing the dimensional changes of the actual component. Navy C-rings of varying thicknesses were used to analyze the effect of distortions with the change in thickness. Finite element simulations of Navy C-rings and torque converter pistons were developed to study the effect of nitrocarburizing process on distortions. For thinner specimens, the predicted distortions compare favorably with the experimental values. However, the thicker C-ring specimens showed high prediction error. To better understand the prediction difference of the different C-rings and TC pistons, an empirical ratio (bulk volume to nitrocarburized volume (V/VN)) was introduced. The V/VN ratio will not only help to separate the nitrocarburized surface dominant and bulk dominant specimens, but also provide a better comparison of the C-ring size with the actual component. The reduction of bulk volume to nitrocarburized volume (V/VN) ratio led to a decrease in both the C-ring’s inner diameter (ID) and gap width (GW) distortion, and a small increase in outer diameter (OD) distortion. A composition-depth profile simulation model was also developed to predict the local distortion due to nitrocarburized phases. The local distortion results showed that the γ′-phase (Fe4N) in the diffusion zone dominates the overall magnitude of distortion. The residual stress distribution for 1-step nitrocarburizing treatment was successfully modeled and validated against the experimental stress. The surface stress for one-step nitrocarburizing treatment was found to be tensile in nature. This tensile (surface) residual stress was further reduced by introducing 2-step nitrocarburizing treatments. Using the nitrogen profile model for 2-step nitrocarburizing, it was found that the additional γ′-Fe4N phase at the surface resulted in a notable residual (tensile) stress reduction. Also, the simulated 2-step nitrocarburizing treatments produced a same level of distortion as 1-step nitrocarburizing treatments. Therefore, the proposed 2-step nitrocarburizing treatments could potentially improve both the surface quality and fatigue resistance

    Poster: implications of merging phases on scalability of multi-core architectures

    Get PDF
    Amdahl's Law estimates parallel applications with negligible serial sections to potentially scale to many cores. However, due to merging phases in data mining applications, the serial sections do not remain constant. We extend Amdahl's model to accommodate this and establish that Amdahl's Law can overestimate the scalability offered by symmetric and asymmetric architectures for such applications. Implications: 1) A better use of the chip area is for fewer and hence more capable cores rather than simply increasing the number of cores for symmetric and asymmetric architectures and 2) The performance potential of asymmetric over symmetric multi-core architectures is limited for such applications

    ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes

    Get PDF
    Parallel applications often rely on work stealing schedulers in combination with fine-grained tasking to achieve high performance and scalability. However, reducing the total energy consumption in the context of work stealing runtimes is still challenging, particularly when using asymmetric architectures with different types of CPU cores. A common approach for energy savings involves dynamic voltage and frequency scaling (DVFS) wherein throttling is carried out based on factors like task parallelism, stealing relations, and task criticality. This article makes the following observations: (i) leveraging DVFS on a per-task basis is impractical when using fine-grained tasking and in environments with cluster/chip-level DVFS; (ii) task moldability, wherein a single task can execute on multiple threads/cores via work-sharing, can help to reduce energy consumption; and (iii) mismatch between tasks and assigned resources (i.e., core type and number of cores) can detrimentally impact energy consumption. In this article, we propose EneRgy Aware SchedulEr (ERASE), an intra-application task scheduler on top of work stealing runtimes that aims to reduce the total energy consumption of parallel applications. It achieves energy savings by guiding scheduling decisions based on per-task energy consumption predictions of different resource configurations. In addition, ERASE is capable of adapting to both given static frequency settings and externally controlled DVFS. Overall, ERASE achieves up to 31% energy savings and improves performance by 44% on average, compared to the state-of-the-art DVFS-based schedulers

    Coordinated management of DVFS and cache partitioning under QoS constraints to save energy in multi-core systems

    Get PDF
    Reducing the energy expended to carry out a computational task is important. In this work, we explore the prospects of meeting Quality-of-Service requirements of tasks on a multi-core system while adjusting resources to expend a minimum of energy. This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and the per-core voltage–frequency settings to save energy while respecting QoS requirements of every application in multi-programmed workloads run on multi-core systems. It does so by doing configuration-space exploration across the spectrum of LLC partition sizes and Dynamic Voltage–Frequency Scaling (DVFS) settings at runtime at negligible overhead. We show that the energy of 4-core and 8-core systems can be reduced by up to 18% and 14%, respectively, compared to a baseline with even distribution of cache resources and a fixed mid-range core voltage–frequency setting. The energy savings can potentially reach 29% if the QoS targets are relaxed to 40% longer execution time

    Poster: Implications of Merging Phases on Scalability of Multi-core Architectures

    Get PDF
    ABSTRACT Amdahl's Law estimates parallel applications with negligible serial sections to potentially scale to many cores. However, due to merging phases in data mining applications, the serial sections do not remain constant. We extend Amdahl's model to accommodate this and establish that Amdahl's Law can overestimate the scalability offered by symmetric and asymmetric architectures for such applications. Implications: 1) A better use of the chip area is for fewer and hence more capable cores rather than simply increasing the number of cores for symmetric and asymmetric architectures and 2) The performance potential of asymmetric over symmetric multi-core architectures is limited for such applications

    Runtime-guided management of stacked DRAM memories in task parallel programs

    Get PDF
    Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Union’s Horizon 2020 research and innovation programme (grant agreement 779877). M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    Towards Runtime-Assisted Cache Management for Task-Parallel Programs

    No full text
    Architects have adopted the shared memory model that implicitly manages cache coherence and cache capacity in hardware, mainly to aid programmability of multi-core architectures. The hardware mechanisms are however prone to inefficiencies because they are not tailored to the behavior of individual parallel applications. Specifically, the manner in which sharing patterns are handled by the coherence protocol (e.g MESI) leads to coherence inefficiencies and the manner in which data access patterns are handled by the replacement policy (e.g LRU) leads to capacity inefficiencies (due to dead blocks). The underlying strategy adopted by hardware-based proposals to address these inefficiencies is simple: glean information about sharing/access patterns by analyzing accesses to cache blocks and enable optimizations if conditions that lead to inefficiencies are detected.This thesis proposes new approaches that leverage rich information about parallel applications, available to the runtime system in task-based and task data-flow programming models, regarding tasks and its working sets, inter-task dependencies and mapping of tasks to cores in order to effectively address inefficiencies introduced due to hardware management of cache coherence and cache capacity. This thesis also establishes the utility of hardware-based proposals that address the coherence and cache capacityinefficiencies in the context of such parallel applications.This thesis makes the following contributions in addressing cache coherence and capacity inefficiencies. The thesis first proposes a forwarding scheme in hardware that tracks updates at the producer with low-overhead and initiates forwarding after consumers issue the first request to access the updated data in order to mitigate coherence overheads in producer-consumer sharing. A hybrid technique is then proposed that detects producer-consumer and migratory sharing patterns in the runtime. This information is then communicated to the cache coherence substrate to trigger appropriate coherence optimization to mitigate coherence overheads due to these sharing patterns. As for cache capacity management this thesis focuses on managing dead blocks which have been shown to lead to inefficient utilization of cache capacity. This thesis proposes a novel technique that exploits information exchange across runtime system and architecture to detect dead blocks in the last level cache more efficiently and signal them for eviction. Finally, this thesis leverages the outlook for future accesses to data provided by the runtime system to identify blocks that are dead in the entire hierarchy and evict them simultaneously. This approach to global management of dead blocks is shown to be beneficial over local approaches adopted by hardware-based proposals that predict dead blocks at each level individually

    Towards Runtime-Assisted Cache Management for Task-Parallel Programs

    No full text
    Architects have adopted the shared memory model that implicitly manages cache coherence and cache capacity in hardware, mainly to aid programmability of multi-core architectures. The hardware mechanisms are however prone to inefficiencies because they are not tailored to the behavior of individual parallel applications. Specifically, the manner in which sharing patterns are handled by the coherence protocol (e.g MESI) leads to coherence inefficiencies and the manner in which data access patterns are handled by the replacement policy (e.g LRU) leads to capacity inefficiencies (due to dead blocks). The underlying strategy adopted by hardware-based proposals to address these inefficiencies is simple: glean information about sharing/access patterns by analyzing accesses to cache blocks and enable optimizations if conditions that lead to inefficiencies are detected.This thesis proposes new approaches that leverage rich information about parallel applications, available to the runtime system in task-based and task data-flow programming models, regarding tasks and its working sets, inter-task dependencies and mapping of tasks to cores in order to effectively address inefficiencies introduced due to hardware management of cache coherence and cache capacity. This thesis also establishes the utility of hardware-based proposals that address the coherence and cache capacityinefficiencies in the context of such parallel applications.This thesis makes the following contributions in addressing cache coherence and capacity inefficiencies. The thesis first proposes a forwarding scheme in hardware that tracks updates at the producer with low-overhead and initiates forwarding after consumers issue the first request to access the updated data in order to mitigate coherence overheads in producer-consumer sharing. A hybrid technique is then proposed that detects producer-consumer and migratory sharing patterns in the runtime. This information is then communicated to the cache coherence substrate to trigger appropriate coherence optimization to mitigate coherence overheads due to these sharing patterns. As for cache capacity management this thesis focuses on managing dead blocks which have been shown to lead to inefficient utilization of cache capacity. This thesis proposes a novel technique that exploits information exchange across runtime system and architecture to detect dead blocks in the last level cache more efficiently and signal them for eviction. Finally, this thesis leverages the outlook for future accesses to data provided by the runtime system to identify blocks that are dead in the entire hierarchy and evict them simultaneously. This approach to global management of dead blocks is shown to be beneficial over local approaches adopted by hardware-based proposals that predict dead blocks at each level individually
    corecore